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Preconditioning is a technique from numerical linear algebra that 
can accelerate algorithms to solve systems of equations. In this pa- 
per, we demonstrate how preconditioning can circumvent a stringent 
assumption for sign consistency in sparse linear regression. Given 
X £ R" xp and Y € R" that satisfy the standard regression equation, 
this paper demonstrates that even if the design matrix X does not 
satisfy the irrepresentable condition for the Lasso, the design matrix 
FX often does, where F £ R nx " is a preconditioning matrix defined 
in this paper. By computing the Lasso on (FX, FY), instead of on 
(X, Y), the necessary assumptions on X become much less stringent. 

Our preconditioner F ensures that the singular values of the de- 
sign matrix are either zero or one. When n > p, the columns of FX 
are orthogonal and the preconditioner always circumvents the strin- 
gent assumptions. When p > n, F projects the design matrix onto 
the Stiefel manifold; the rows of FX are orthogonal. We give both 
theoretical results and simulation results to show that, in the high di- 
mensional case, the preconditioner helps to circumvent the stringent 
assumptions, improving the statistical performance of a broad class 
of model selection techniques in linear regression. Simulation results 
are particularly promising. 

1. Introduction. Recent breakthroughs in information technology have provided new 
experimental capabilities in astronomy, biology, chemistry, neuroscience, and several other 
disciplines. Many of these new measurement devices create data sets with many more "mea- 
surements" than units of observation. For example, due to experimental constraints, both 
fMRI and microarray experiments often include tens or hundreds of people. However, the 
fMRI and microarray technologies can simultaneously measure 10's or 100's of thousands 
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of different pieces of information for each individual. Classical statistical inference in such 
"high-dimensional" or p S> n regimes is often impossible. Successful experiments must rely 
on some type of sparsity or low-dimensional structure. Several statistical techniques have 
been developed to exploit various types of structure, including sparse high dimensional 
regression. 

Sparse high dimensional regression aims to select the few measurements (among the 
10's of thousands) that relate to an outcome of interest; these techniques can screen out 
the irrelevant variables. A rich theoretical literature describes the consistency of various 
sparse high dimensional regression techniques, highlighting several potential pitfalls (e.g. 
Knight and Fu [2000], Fan and Li [2001], Greenshtein and Ritov [2004], Donoho ct al. 
[2006], Meinshausen and Biihlmann [2006], Tropp [2006], Zhao and Yu [2006], Zou [2006], 
Zhang and Huang [2008], Fan and Lv [2008], Wainwright [2009], Meinshausen and Yu 
[2009], Bickel et al. [2009], Zhang [2010], Shao and Deng [2012]). In this literature, one of 
the most popular measures of asymptotic performance is sign consistency, which implies 
that the estimator selects the correct set of predictors asymptotically. One of the most 
popular methods in sparse regression, the Lasso (defined in Section 1.1), requires a stringent 
"irrepresentable condition" to achieve sign consistency [Tibshirani, 1996, Zhao and Yu, 
2006]. The irrepresentable condition restricts the correlation between the columns of the 
design matrix in a way made explicit in Section 1.1. 

It is well known that the Ordinary Least Squares (OLS) estimator performs poorly 
when the columns of the design matrix are highly correlated. However, this problem can 
be overcome by more samples; OLS is still consistent. With the Lasso, the detrimental 
effects of correlation are more severe. If the columns of the design matrix are correlated in 
a way that violates the irrepresentable condition, then the Lasso will not be sign consistent 
(i.e. statistical estimation will not improve with more samples). 

To avoid the irrepresentable condition, several researchers have proposed alternative 
penalized least square methods that use a different penalty from the Lasso penalty. For 
example, Fan and Li [2001] propose SCAD, a concave penalty function; Zhang [2010] pro- 
poses the minimax concave penalty (MCP), another concave penalty function, and gives 
high probability results for PLUS, an optimization algorithm. Unfortunately, these concave 
penalties lead to nonconvex optimization problems. Although there are algorithmic approx- 
imations for these problems and some high probability results [Zhang, 2010], the estimator 
that these algorithms compute is not necessarily the estimator that optimizes the penalized 
least squares objective. The Adaptive Lasso provides another alternative penalty which is 
a data adaptive and heterogeneous [Zou, 2006]. Unfortunately, its statistical performance 
degrades in high dimensions. 

In penalized least squares, there is both a penalty and a data fidelity term (which makes 
the estimator conform to the data). The papers cited in the previous paragraph adjust 
the type of sparse penalty. In this paper, we precondition the data, which is equivalent to 
adjusting the data fidelity term. Other researchers have previously proposed alternative 
ways of measuring data fidelity [Van De Geer, 2008], but their alternatives are meant 
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to accommodate different error distributions, not to avoid the irrepresentable condition. 
Similar to work presented here, Xiong et al. [2011] also propose adjusting the data fidelity 
term to avoid the irrepresentable condition. They proposed a procedures which (1) makes 
the design matrix orthogonal by adding rows, and (2) applies an EM algorithm, with 
SCAD, to estimate the outcomes corresponding to the additional rows in the design matrix. 
Although this algorithm performs well in the low dimensional case, it is computationally 
expensive in high dimensional problems. The procedure proposed in this paper adjusts the 
data fidelity term by preconditioning, a preprocessing step. Relative to these alternative 
methods, preconditioning is easier to implement, requiring only a couple lines of code 
before calling any standard Lasso package. Furthermore, this type of preprocessing is widely 
studied in a related field, numerical linear algebra. 

Preconditioning describes a popular suite of techniques in numerical linear algebra that 
stabilize and accelerate algorithms to solve systems of equations (e.g. Axelsson [1985], 
Golub and Van Loan [1996]). In a system of equations, one seeks the vector x that satisfies 
Ax = b, where A E W ixn is a given matrix and b E M. n is a given vector. The speed of most 
solvers is inversely proportional to the condition number of the matrix A, the ratio of its 
largest eigenvalue over its smallest eigenvalue. When the matrix A has both large eigen- 
values and small eigenvalues, the system Ax = b is "ill-conditioned." For example, if the 
matrix A has highly correlated columns, it will be ill-conditioned. One can "precondition" 
the problem by left multiplying the system by a matrix T, TAx = Tb; the preconditioner 
T is designed to shrink the condition number of A thereby accelerating the system solver. 
If the columns of A are highly correlated, then preconditioning decorrelates the columns. 

The system of equations Ax = b has many similarities with the linear regression equation 



where we observe Y E M n and X E M nxp , and e E M ra contains unobserved iid noise 
terms with E(e) = and var(e) = a 2 I n . From the system of equations Ax = b, the 
regression equation adds an error term and allows for the design matrix to be rectangular. 
Where numerical linear algebraists use preconditioners for algorithmic speed, this paper 
shows that preconditioning can circumvent the irrepresentable condition, improving the 
statistical performance of the Lasso. 

Just as preconditioning sidesteps the difficulties presented by correlation in systems of 
equations, preconditioning can sidestep the difficulties in sparse linear regression. Numerical 
algebraists precondition systems of linear equations to make algorithms faster and more 
stable. In this paper, we show that preconditioning the regression equation (Equation 1) 
can circumvent the irrepresentable condition. For a design matrix X E IR nxp , we study a 
specific preconditioner F £ M nxn that is defined from the singular value decomposition 
of X = UDV. We call F = UD~ l U' the Puffer Transformation because it inflates the 
smallest nonsingular values of the design matrix (Section 2.1 discusses this in more detail). 
This paper demonstrates why the matrix F~X. can satisfy the irrepresentable condition, 
while the matrix X may not; in essence, the preconditioner makes the columns of X less 
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correlated. When n > p, the columns of -FX are exactly orthogonal, trivially satisfying 
the ir represent able condition. When n < p and the columns of X are moderately or highly 
correlated, F can greatly reduce the the pairwise correlations between columns, making 
the design matrix X more amenable to the irrepresentable condition. 

In a paper titled "Preconditioning" for feature selection and regression in high- dimensional 
problems, the authors propose projecting the outcome Y onto the top singular vectors of X 
before running the Lasso [Paul et al., 2008]. They leave X unchanged. The current paper 
preconditions the entire regression equation, which reduces the impact of the top singular 
vectors of X, and thus reduces the correlation between the columns of X. Whereas the 
preprocessing step in Paul et al. [2008] performs noise reduction, the Puffer Transform 
makes the design matrix conform to the irrepresentable condition. 

The outline of the paper is as follows: Section 1.1 gives the necessary mathematical 
notation and definitions. Section 2 introduces the Puffer Transformation and gives a ge- 
ometrical interpretation of the transformation. Section 3 discusses the low dimensional 
setting (p < n), where the Puffer Transformation makes the columns of FX orthogonal. 
Theorem 1 gives sufficient conditions for the sign consistency of the preconditioned Lasso 
when p < n; these sufficient conditions do not include an irrepresentable condition. Section 
4 discusses the high dimensional setting (p > n), where the Puffer Transformation projects 
the design matrix onto the Stiefel manifold. Theorem 2 shows that most matrices on the 
Stiefel manifold satisfy the irrepresentable condition. Theorem 3 gives sufficient conditions 
for the sign consistency of the preconditioned Lasso in high dimensions. This theorem in- 
cludes an irrepresentable condition. Section 5 shows promising simulations that compare 
the preconditioned Lasso to several other (un-preconditioned) methods. Section 6 describes 
four data analysis techniques that incidentally precondition the design matrix with a po- 
tentially harmful preconditioner; just as a good preconditioner can improve estimation 
performance, a bad preconditioner can severely detract from performance. Users should be 
cautious when using the four techniques described in Section 6. Section 7 concludes the 
paper. 

1.1. Preliminaries. To define the Lasso estimator, suppose the observed data are inde- 
pendent pairs {(xi, Yi)} Gl'xK for i = 1,2, ... , n following the linear regression model 



where xf is a row vector representing the predictors for the ith observation, Yi is the 
corresponding ith. response variable, and e^'s are independent, mean zero noise terms with 
variance a 2 . The unobserved coefficients are j3* Gf. Use X £ M nxp to denote the n x p 




Yi = xfP* + €i 



design matrix with xT 



(Xfci, . . . , Xfcp) as its kth row and with Xj = (Xji, . . . , X, n ) as 
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its jth column, then 



X 



T 
x 2 



(Xi,X 2 , . . . , Xp) . 



Let Y = (Yi, . . . , Y n ) T and e = (ei, e 2 , 

For penalty function pen(b) : W — > 
function, 

(3) £(b,pen,\) = ^ 

The Lasso estimator uses the l\ penalty, 



■ ,ej e 



x, define the penalized least squares objective 



|Y-X6||^+pen(6,A). 



(4) 



/3(A) = argmin-||Y-X6||2 + A ||6||i. 

b 2 



where for some vector b G M p , ||6|| r = iz~2i=i \ x i\ r ) 1 ^ r - 

The popularity of the Lasso (and other sparse penalized least squares methods) stems 
from the fact that, for large enough values of A, the estimated coefficient vectors contain 
several zeros. If one is willing to assume the linear regression model, then the Lasso esti- 
mates which columns in X are conditionally independent of Y given the other columns in 
X. 

1.1.1. Sign consistency and the irrepresentable condition. For T C {1, . . . ,p} with |T| = 
t, define X(T) G IR nx * to contain the columns of X indexed by T. For any vector x 6 W, 
define x(T) = (xj)j^T- S C {1, . . . ,p}, the support of (3*, is defined 

S = {j:(3*^0}. 

Define s = \S\. In order to define sign consistency, define 



sign(x) 



1 if x > 
if x = 
-1 ifx<0, 



and for a vector b, sign(fe) is defined as a vector with the iih element sign(6j). 

Definition 1. The Lasso is sign consistent if there exists a sequence \ n such that, 



(sign0(X n )) = sign(/3* 



1, as n — ^ oo. 
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In other words, /3(A) can asymptotically identify the relevant and irrelevant variables 
when it is sign consistent. Several authors, including Meinshausen and Buhlmann [2006], 
Zou [2006], Zhao and Yu [2006], Yuan and Lin [2007], have studied the sign consistency 
property and found a sufficient condition for sign consistency. Zhao and Yu [2006] called 
this assumption the "irrepresentable condition" and showed that it is almost necessary for 
sign consistency. 

Definition 2. The design matrix X satisfies the Irrepresentable condition for (3* 

if, for some constant rj £ (0, 1], 



(5) 



X{S C ) T X(S) (X(S) T X(S)) sign{P*(S)) 



< l-?7, 



where for a vector x, \\x\ 



In practice, this condition is difficult to check because it relies on the unknown set S. 
Section 2 of Zhao and Yu [2006] gives several sufficient conditions. For example, their 
Corollary 2 shows that if |cor(Xj, Xj)\ < c/(2s — 1) for a constant < c < 1, then 
the irrepresentable condition holds. Theorem 2 in Section 4 of this paper relies on their 
corollary. 

2. Preconditioning to circumvent the stringent assumption. We will always 
assume that the design matrix X E M. nxp has rank d = min{n,p}. From singular value 
decomposition, there exist matrices U £ M. nxd and V G MP xd with U T U = V T V = Id and 
diagonal matrix D 6 M. dxd such that X = UDV . Define the Puffer Transformation, 

(6) F = UD^U 7 \ 

The preconditioned design matrix FX has the same singular vectors as X. However, all of 
the nonzero singular values of FX are set to unity: FX = UV' . When n > p, the columns 
of FX are orthonormal. When n < p, the rows of FX are orthonormal. 

Define Y = FY, X = FX, and e = Fe. After left multiplying the regression equation 
Y = X/3* + e by the matrix F, the transformed regression equation becomes 

(7) Y = X/3* + e 

If e ~ iV(0,CT 2 / n ), then e ~ JV(0, t) where S = a 2 UD- 2 U T . 

The scale of E depends on the diagonal matrix D, which contains the d singular values 
of X. As the singular values of X approach zero, the corresponding elements of D~ 2 grow 
very quickly. This increased noise can quickly overwhelm the benefits of a well conditioned 
design matrix. For this reason, it might be necessary add a Tikhonov regularization term to 
the diagonal of D. The simulations in Section 5 show that when p ~ n, the transformation 
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can harm estimation. Future research will examine if a Tikhonov regularizer resolves this 
issue. 

In numerical linear algebra, the objective is speed, and there is a trade off between 
the time spent computing the preconditioner vs. solving the system of equations. Better 
preconditioners make the original problem easier to solve. However, these preconditioners 
themselves can be time consuming to compute. Successful preconditioners balance these 
two costs to provide a computational advantage. In our setting, the objective is inference, 
not speed per se, and the tradeoff is between a well behaved design matrix and a well 
behaved error term. Preconditioning can aid statistical inference if it can balance these two 
constraints. 

2.1. Geometrical representation. The figures in this section display the geometry of 
the Lasso before and after the Puffer Transformation. These figures (a) demonstrate what 
happens when the ir represent able condition is not satisfied, (b) reveal how the Puffer 
Transformation circumvents the irrepresentable condition, and (c) illustrate why we call F 
the Puffer Transformation. 

The figures in this section are derived from the following optimization problem which is 
equivalent to the Lasso. 1 

(8) /3(c) = arg min ||y-X&||ij 

Given the constraint set ||6||i < c and a continuum of sets \\Y — Xb^fe < x for x > 0, 
define 

l(c,x) = {b : \\b\\x < c} n {b : \\Y - Xb\\% < x}. 

When c is small enough, Z(c, 0) is an empty set, implying that there is no b with ||6||i < c 
such that Y = Xb. To find /3(c), increase the value of x until I(c, x) is no longer an empty 
set. Let x* be the smallest x such that I(c, x) is nonempty. Then, /3(c) £ I(c,x*). Under 
certain conditions on X (e.g. full column rank), the solution is unique and /3(c) = I(c, x*). 

Figures 1 and 2 below give a graphical representation of this description of the Lasso 
before and after preconditioning. The constraint set {b : \\b\\i < c} appears as a diamond 
shaped polyhedron and the level set of the loss function {b : \\Y — Xb\\\ < x} appears as 
an ellipse. Starting from x = 0, x increases, dilating the ellipse, until the ellipse intersects 
the constraint set. The first point of intersection represents the solution to the Lasso. This 
point /3(c) £ MP is the element of the constraint set which minimizes \\Y — Xb\\2- 

In both Figure 1 and Figure 2, the rows of X are independent Gaussian vectors with 
mean zero and covariance matrix 




In an abuse of notation, the left hand side of Equation 8 is /3(c). In fact, there is a one-to-one function 
</>(c) = A to make the Lagrangian form of the Lasso (Equation 4) equivalent to the constrained form of the 
Lasso (Equation 8). 
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FlG 1. When the problem is not preconditioned, the elongated ellipse (representing the £2 loss) intersects the 
£1 ball off of the plane created from the axes that point to the left. The Lasso fails to select the true model. 

To highlight the effects of preconditioning, the noise is very small and n = 10, 000. In 
Figure 1, the design matrix is not preconditioned. In Figure 2, the problem has been 
preconditioned, and the ellipse represents the set \\FY — FXb\\2 < x for some value of x. 
The preconditioning turns the ellipse into a sphere. 

In both Figure 1 and Figure 2, (3* = (1,1,0). In both figures, the third dimension is 
represented by the axis that points up and down. Thus, if the ellipse intersects the constraint 
set in the (horizontal) plane formed by the first two dimensions, then the Lasso estimates 
the correct sign. In Figure 1, the design matrix fails the irrepresentable condition and the 
elongated shape of the ellipse forces j3{c) off of the true plane. This is illustrated in the right 
panel of Figure 1. High dimensional regression techniques that utilize concave penalties 
avoid this geometrical misfortune by changing the shape of the constraint set. Where 
the l\ ball has a flat surface, the non-convex balls flex inward, dodging any unfortunate 
protrusions of the ellipse. As preconditioning acts on the opposite set, it restricts the 
protrusions in the ellipse. 

In Figure 2, the elongated direction of the ellipse shrinks down, and the ellipse is puffed 
out into a sphere; it then satisfies the irrepresentable condition, and /3(A) lies in the true 
plane. Therefore, in this example, preconditioning circumvents the stringent condition for 
sign consistency. When n > p preconditioning makes the ellipse a sphere. When p > n, 
preconditioning can drastically reduce the pairwise correlations between columns, thus 
making low dimensional projections of the ellipse more spherical. Both Figure 1 and Figure 
2 were drawn with the R library rgl. 

Figure 2 illustrates why the Puffer Transformation is so named. We call F the Puffer 
Transformation in reference to the pufferfish. In its relaxed state, a pufferfish looks similar 
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Fig 2. After preconditioning, the £2 loss ellipse turns to a sphere. This ensures that it intersects the poly- 
hedron in the plane. After preconditioning, the Lasso correctly selects the true model. 

to most other fish, oval and oblong. However, when a pufferfish senses danger, it defends 
itself by inflating its body into a sphere. In regression, if the columns of the design matrix 
are correlated, then the contours of the loss function ^2(6) = ||^ — X6|| 2 are oval and 
oblong as illustrated in Figure 1. If a data analyst has reason to believe that the design 
matrix might not satisfy the irrepresentable condition, then she could employ the Puffer 
Transformation to "inflate" the smallest singular values, making the contours of l-2,F{b) = 
H^y— -FX6|| 2 spherical. Although these contours are not faithful to the covariance structure 
of the standard OLS estimator, the spherical contours are more suited to the geometry of 
the Lasso in many settings. 

3. Low dimensional Results. This section demonstrates that for n > p, after the 
Puffer Transformation, the Lasso is sign consistent with a minimal assumption on the 
design matrix X. When n > p, the preconditioned design matrix is orthonormal. 

(FX) T FX = VDU T UD- 1 U T UD- l U T UDV T = I 

The irrepresentable condition makes a requirement on 

x(s c )'x(s) (xisyxis))' 1 

Since the Puffer Transformation makes the columns of the design matrix orthonormal, 
(FX(S C ))FX(S) = 0, satisfying irrepresentable condition. 

Theorem 1. Suppose that data (X, Y) follows the linear model described in Equation 
(1) with iid Gaussian noise e ~ iV(0, o~ 2 I n ) . Define the singular value decomposition of X as 
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X = U DV' . Suppose that n > p and X has rank p. We further assume that A m ; n (^X / X) > 
C m i n > 0. Define the Puffer Transformation , F = UD~ l U T . Let X = FX and 
Y = FY. Define 

/3(A) = argminJ||y-X6||2 + A||6|| 1 . 

b L 

J/minjgs 1/3*^1 > 2A, then with probability greater than 

nX^Cmm I 



1 — 2p exp 



2a 2 



/3(A) = s (3* 



Remarks. Suppose that C m \ n > is a constant. If p, minjgs \/3*j\ and a 2 do not change 
with re, then choosing A such that (1) A — > and (2) A 2 n — > oo ensures that /3(A) is sign 
consistent. One possible choice is A — • ' 



n 



From classical linear regression, we know that increasing the correlation between columns 
of X increases the variance of the standard OLS estimator; correlated predictors make esti- 
mation more difficult. This intuition translates to preconditioning with the Lasso; increas- 
ing the correlation between the columns of X decreases the smallest singular value of X, 
increasing the variance of the noise terms cov(Fe) = UD~ 2 U', and weakening the bound 
in Theorem 1. When the columns of X are correlated, then C m { n is small and Theorem 1 
gives a smaller probability of sign consistency. 

Theorem 1 applies to many more sparse regression methods that use penalized least 
squares. After preconditioning, the design matrix is orthogonal and several convenient 
facts follow. First, if the penalty decomposes, 

p 

pen(b, A) = ' s ^penj(bj, A) 
i=i 

so that penj does not rely on bk for k ^ j, then the penalized least squares method admits 
a closed form solution. If it is also true that all the pen^s are equivalent and have a cusp 
at zero (like the Lasso penalty), then the method selects the same sequence of models 
as the Lasso and correlation screening (i.e. select Xj if \cor(Y, Xj)\ > A) [Fan and Lv, 
2008]. For example, both SCAD and MCP satisfy these conditions. If a method selects 
the same models as the Lasso, and the Lasso is sign consistent, then this method is also 
sign consistent. Thus, Theorem 1 implies that (1) preconditioned correlation screening, (2) 
preconditioned SCAD, and (3) preconditioned MCP are sign consistent under some mild 
conditions (similar to the conditions in Theorem 1). However, in high dimensions, FX is 
no longer orthogonal. So, the various methods could potentially estimate different models. 
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4. High dimensional Results. The results for p > n are not as straightforward 
because the columns of the design matrix cannot be orthogonal. However, the results in this 
section suggest that for many high dimensional design matrices X, the matrix FX satisfies 
the stringent assumptions of the Lasso. In fact, the simulation results in the following 
section suggest that preconditioning offers dramatic improvements in high dimensions. 

Before introducing Theorem 2, Figure 3 presents an illustrative numerical simulation 
to prime our intuition on preconditioning in high dimensions. In this simulation, n = 
200, p = 10,000, and each row of X is an independent Gaussian vector with mean zero 
and covariance matrix E. The diagonal of E is all ones and the off diagonal elements of E 
are all .9. The histogram in Figure 3 includes both the distribution of pairwise correlations 
between the columns of X and the distribution of pairwise correlations between the columns 
of FX. Before the transformation, the pairwise correlations between the columns of X have 
an average of .90 with a standard deviation of .01. After the transformation, the average 
correlation is .005, and the standard deviation is .07. Figure 3 shows this massive reduction 
in correlation. The histogram has two modes; the left mode corresponds to the distribution 
of pairwise correlations in -FX, and the right mode corresponds to the distribution of 
correlations in X. 



Preconditioning reduces the pairwise correlations 
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Fig 3. This figure displays the pairwise correlations between the columns o/FX (on the left) and the pairwise 
correlations between the columns o/X (on the far right). In this simulation, the rows o/X are iid multivariate 
Gaussians with mean zero and covariance matrix E, = 1 and Ei^j = .9. The figure was created by first 
sampling 10,000 (i,j) pairs without replacement such that i 7^ j and then computing cor(FXi,FXj) and 
cor(Xi,Xj) for each of these 10,000 pairs. Before preconditioning, the pairwise correlations are much larger. 
In this setting, preconditioning reduces the pairwise correlations. 



By reducing the pairwise correlations, preconditioning helps the design matrix satisfy the 
irrepresentable condition. For example, if the first twenty elements of f3* are positive and 
the other 9,980 elements equal zero, then the Puffer Transformation makes the design ma- 
trix satisfy the irrepresentable condition in this simulation. Recall that the irrepresentable 
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condition bounds the quantity 



0) 




oo 



for 77 > 0. From the simulated data displayed in Figure 3, the left hand side of Equation 
(9) evaluated with the matrix X equals 1.09; evaluated with FX, it equals .84. In this 
example, the Puffer Transformation circumvents the irrepresentable condition. 

4.1. Uniform distribution on the Stiefel manifold. When n > p, the columns of FX are 
orthogonal. When p > n, the rows are orthogonal. FX lies in the Stiefel manifold, 



Further, under any unitarily invariant norm, FX is the projection of X onto V(n,p) [Fan 
and Hoffman, 1955]. Since V(n,p) is a bounded set, it has a uniform distribution. 

Definition 3 (Chikuse [2003]). A random matrix V is uniformly distributed on V(n,p), 
written V ~ uniform(V(n,p)), if the distribution ofV is equal to the distribution ofVO 
for any fixed O in the orthogonal group of matrices 0(p,M), where 



In this section, Theorem 2 shows that if V is uniformly distributed on V(n,p), then after 
normalizing the columns of V to have equal length, the matrix satisfies the irrepresentable 
condition with high probability in certain regimes of n,p, and s. Propositions 1 and 2 give 
two examples of random design matrices X where FX is uniformly distributed on V(n,p). 

Theorem 2. Suppose X G M rixp is distributed uniform(V(n,p)) and n > cs 4 , where 
s is the number of relevant predictors and c is some constant, then asymptotically, the 
irrepresentable condition holds for normalized version of X with probability no less than 



where the normalized version of X , denoted X is defined as Xij = -XyA/ X)£=i Xfj- 



The proof of Theorem 2 relies on the fact that if \cor(X{, Xj)\ < c/(2s — 1) for all 
pairs then X satisfies the irrepresentable condition. This line of argument requires 

that n > as 4 . A similar argument applied to a design matrix X with iid iV(0, 1) entries 
only requires that n > cs 2 (this is included in Theorem 8 in the appendix Section B.2). 
If both of these are tight bounds, then it suggests that the irrepresentable condition for 
V ~ uniform(V(n,p)) is potentially more sensitive to the size of s than a design matrix 
with iid N(0, 1) entries. The final simulation in Section 5 suggests that this difference is 



FX G V{n,p) = {V G R nxp : VV' = /„}. 



0(p,R) = {Oe R pxp : OO' = I p }. 



1 — 4p 2 exp(— n 
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potentially an artifact of our proof. Both distributions are almost equally sensitive to s. If 
anything, uniform(V(n,p)) has a slight advantage in our simulation settings. 

Propositions 1 and 2 give two models for X that makes FX uniformly distributed on 
the Stiefel manifold. 

Proposition 1. If the elements o/X are independent N (0,1) random variables, then 
FX ~ uniform(V(n,p)). 

Proof. Define U, V, D by the SVD of X, X = UDV'. For a fixed O G 0(p, R), V' = V'O 
is an element of V(n,p). Therefore, the SVD of XO = UDV' . This implies that both X 
and XO have the same Puffer Transformation F = UD~ 1 U' . This yields the result when 
combined with the fact that X = XO, 

FX = FXO. 

□ 

Proposition 2. Suppose that Us G R pxp is drawn uniformly from the orthonormal 
group of matrices 0(p, K) and G R pxp is a diagonal matrix with positive entries (either 
fixed or random). Define £ = UsD^U^ and suppose the rows o/X are drawn independently 
from iV(0,£), then FX ~ uniform(V(n,p)). 

PROOF. Notice that if Z G R nxp has iid N(0, 1) elements and Z is independent of Ux, 
then X = ZY}I 2 . The following equalities in distribution follow from the fact that for any 
O G 0{p, R), U E = OUz and ZO = Z. 

FX = (XX T )" 1/2 X 

= (Z^Z T Y X I 2 ZY}I 2 

= (ZOU 1] Dj ] UlO T Z T )- l ' 2 ZOUsD]{ 2 UlO T 

= (zu^D^ulz 7 )- 1 / 2 zu^d]1 2 uIo t 

= FXO 

Thus, FX ~ uniform(V(n,p)). □ 

Proposition 2 presents a scenario in which the distribution of FX is independent of the 
eigenvalues of E. Of course, if S has a large condition number, then the transformation 
F = (XX T ) -1 / 2 could potentially induce excessive variance in the noise. 

In practice, one usually gets a single design matrix X and a single preconditioned ma- 
trix FX G V(n,p). It might be difficult to argue that FX ~ uniform(V(n,p)). Instead, 
one should interpret Theorem 2 as saying that nearly all matrices in V(n,p) satisfy the 
irrepresentable condition. 
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So far, this section has illustrated how preconditioning circumvents the irrepresentable 
condition in high dimensions. This next theorem assumes FX satisfies the irrepresentable 
condition and studies when the preconditioned Lasso estimator is sign consistent, high- 
lighting when F might induce excessive noise. 

Theorem 3. Suppose that data (X, Y) follows the linear model described in Equation 
(1) with iid normal noises e ~ N(0,o~ 2 I n ). Define the singular value decomposition of X 
as X = UDV T . Suppose that p >n. We further assume that A m i n (^X (S) T X (S)) > C m i n 
and min^F) 2 ) > pd m i n with constants Cmin > and d m i n > 0. For F = UD~ X U J \ define 
Y = FY and X = FX. Define 

/3(A) = argminJ||y-X6||2 + A||6||i. 

b Z 

Under the following three conditions, 



1. 



< 1 

oo 



X(5 C ) T X(S) (X(S) T X(S) 

2. A min (X(5) T X(5)) > £ 

3. min ie5 > 2\^scp/n 
where c is some constant; we have 

(10) P(p(\)= s p*) >l-2pexp 



pX 2 n 2 d 



2 

min 



2a 2 

This theorem explicitly states when preconditioning will lead to sign consistency. Note 
that d m i n in the exponent of Equation (10) corresponds to the amount of additional noise 
induced by the preconditioning. For p 3> n, the assumption min^D^) > pd m i n is often 
satisfied. For example, it holds if X ~ N(0, S) and the eigenvalues of X are lower bounded. 
To see this, define X = ZS 1 / 2 for Z G R nx P with Z tj N(0, 1). Then, 

D 2 ; > min llw/XH 2 , = min Ijw'ZS 1 / 2 !^ > min ||ti;'Z||2A m i n (S), 

to6K n ,||«)||2=l «>eM n ,||«)||2=l w&W 1 ,\\w\\ 2 =l 

where A m i n (S) is the smallest eigenvalue of S. With high probability, ^Hu/ZH 2 , is bounded 
below [Davidson and Szarek, 2001]. Thus, min^F)^) > pd m i n holds for some constant d mm 
with high probability. 

The enumerated conditions in Theorem 3 correspond to standard assumptions for the 
Lasso to be sign consistent. Condition (1) is the irrepresentable condition applied to X. 
Condition (2) ensures first, that the columns in X(S) are not too correlated and second, that 
the columns in X(5) do not become too short since F rescales the lengths of the columns. 
From Section B.2 and the discussion of the uniform distribution on the Stiefel manifold, 
most matrices in V(n,p) satisfy condition (1) as long as s = o(n 1//4 ) and p 2 = o(exp(y / n)). 
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Similarly, most matrices satisfy condition (2); Theorem 7 in Appendix Section B.2 states 
that the diagonals of X(S , )- r X(5) concentrate around n/p and the off-diagonals concentrate 
around 0. Condition (3) in Theorem 3 ensures that the signal strength does not decay too 
fast relative to A. The next corollary chooses a sequence of A's. 

Corollary 1. Under the conditions of Theorem 3, i/minjgs * s a constant and 
n/(slogp) — > oo, then setting A 2 = \Jn log(p)/(sp 2 ) ensures sign consistency. 

Theorem 3 highlights the tradeoff between (a) satisfying the ir represent able condition 
and (b) limiting the amount of additional noise created by preconditioning. Interestingly, 
even if X satisfies the irrepresentable condition, the Puffer Transformation might still 
improve sign estimation performance as long as the preconditioner can balance the tradeoff. 
To illustrate this, presume that both X and X satisfy the irrepresentable condition with 
constants fj and rj respectively. Preconditioning improves the bound in Equation 10 if 
^/ 2( ^min > V 2 - Alternatively, if d m ; n is very small, then the Puffer Transformation will induce 
excess noise and fi 2 d m \ n < rj 2 , weakening the upper bound. 

5. Simulations. The first two simulations in this section compare the model selection 
performance of the preconditioned Lasso to the standard Lasso, Elastic Net, SCAD, and 
MC+. The third simulation compares the uniform distribution on V(n,p) to the distribu- 
tion that places iid Gaussian ensemble distribution, estimating how often these distributions 
satisfy the irrepresentable condition. 

5.1. Choosing A with BIC. After preconditioning, the noise vector Fe contains statisti- 
cally dependent terms that are no longer exchangeable. This creates problems for selecting 
the tuning parameter in the preconditioned Lasso. For example, if one wants to use cross- 
validation, then the test set should not be preconditioned. After several attempts, we were 
unable to find a cross-validation procedure that led to adequate model selection perfor- 
mance. In this section, there are two sets of simulations that correspond to two ways of 
choosing A. The first set of simulations choose A using a BIC procedure. To ensure that 
the results are not due to the BIC procedure, the second set of simulations choose the A 
such that it selects the first model with ten degrees of freedom. 

The first set of simulations choose A with the following procedure. 
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OLS-BIC; to choose a model in a path of models. 

1) For df = 1:40 

a) starting from the null model, select the first model along the solution path with 
df degrees of freedom. 

b) use the df model to fit an OLS model. The OLS model is fit with the original 
(un-preconditioned) data. 

c) compute the BIC for the OLS model. 

end 

2) Select the tuning parameter that corresponds to the model with the lowest OLS-BIC 
score. 

The OLS models were fit with the R function lm and the BIC was computed with the 
R function BIC. 

In this simulation and that of Section 5.2, n = 250, s = 20, and p grows along the 
horizontal axis of the figures (from 32 to 32 768). All nonzero elements in (5* equal ten and 
cr 2 = 1. The rows of X are mean zero Gaussian vectors with constant correlation p. 

In Figure 4, there are three columns of plots and three rows of plots. The rows correspond 
to different levels of correlation p; .1 in the top row, .5 in the middle row, and .85 on the 
bottom row. The columns correspond to different measurements of the estimation error; 
the number of false negatives on the left, the number of false positives in the middle, and 
the £2 error ||/3(A) — /3*||2 on the right. Each data point in every figure comes from an 
average of ten simulation runs. In each run, both X and Y are resampled. 

In many settings, across both p and p, the preconditioned Lasso simultaneously admits 
fewer false positives and fewer false negatives than the competing methods. The number 
of false negatives when p = .85 (displayed in the bottom left plot of Figure 4) gives 
the starkest example. However, the results are not uniformly better. For example, when 
p n = 250, the preconditioned Lasso performs poorly; this behavior around p ~ n 
appeared in several other simulations that are not reported in this paper. Further, when 
the design matrix has fewer predictors p and lower correlations p, MC+ has smaller £2 
error than the preconditioned Lasso. However, as the correlation increases or the number 
of predictors grows, the preconditioned Lasso appears to outperform MC+ in £2 error. 

5.2. Ensuring that OLS-BIC gives a fair comparison. To ensure that the results in 
Figure 4 are not an artifact of the OLS-BIC tuning, Figure 5 tunes procedure by taking 
the first model to contain at least ten predictors. The horizontal axes displays p on the log 
scale. The vertical axes report the number of false positives divided by ten. Each dot in 
the figure represents an average of ten runs of the simulation, where each run resamples X 
and Y. 

Figure 5 shows that as p grows large, the standard Lasso, Elastic Net (enet), SCAD, and 
MC+ perform in much the same way, whereas the preconditioned Lasso drastically out- 



PRECONDITIONING AND THE LASSO 



17 



False negatives for BIC tuning; rho = .1 False positives for BIC tuning; rho = .1 L2 error for BIC tuning; rho = .1 




5000 15000 25000 5000 15000 25000 5000 15000 25000 

P P P 



Fig 4. In all figures, n — 250 and p grows along the horizontal axis. Each row of X is an independent 
multivariate Gaussian with constant correlation p in the off diagonal of the correlation matrix. The figure 
above has three rows and three columns of plots. The three rows correspond to different values of p. From 
top to bottom, p is .1, .5, and .85. The three columns correspond to the number of false negatives (on left), 
the number of false positives (in middle), and ||/3(A) — /3*||2 (on right). The tuning parameter for each 
method is selected with the OLS-B I C procedure described in Section 5.1. The results from the preconditioned 
Lasso appear as a solid black line. Note that the number of false negatives cannot exceed s = 20. In the 
plots on the left side, a dashed horizontal line at 20 represents this limit. For scale, this dashed line is also 
included in the false positive plots. In the £2 error plots, the dashed line corresponds to the £2 error for 
the estimate = 0. For both p = .5 and .85, the competing methods miss a significant fraction of the true 
nonzero coefficients and only at the very end, for p > 32, 000, does the preconditioned lasso start to miss 
any of the true coefficients. At the same time, the preconditioned Lasso accepts fewer false positives than 
the alternative methods. 
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False positives in first 10; rho = .1 
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Fig 5. To ensure that the results in Fig- 
ure 4 are not due to the choice of tuning 
parameter, this figure chooses the tuning 
parameter in a much simpler fashion. 
Each tuning parameter corresponds to 
the first estimated model that contains 
at least ten predictors. Along the hori- 
zontal axis, p increases on the log scale. 
The vertical axis displays the number 
of false positives divided by ten. The 
line for the elastic net can exceed one 
because the default settings of glmnet 
sometimes fail to select a model with 
exactly 10 predictors. So, the tuning 
method selects the next biggest model. 
Both ft* , and a 2 are unchanged from the 
simulations displayed in Figure 4 and X 
comes from the same distribution. The 
results in this figure are comparable to 
the results in Figure 4- This suggests 
that the results from OLS-BIC provide 
a fair comparison of the methods. 



performs all of them. For the largest p in the middle and lower graphs, the preconditioned 
Lasso yields two or three false positives out of ten predictors; all of the other methods have 
at least eight false positives out of ten predictors. These results suggest that the results in 
Figure 4 were not an artifact of the OLS-BIC tuning. 

The preconditioned Lasso does not perform well when p is close to n = 250. This is 
potentially due to the instability of the spectrum of F when p and n are comparable. 
Results in random matrix theory suggest that in this regime, C m in (the smallest nonzero 
singular value of X) can be very close to zero [Silverstein, 1985]. When the spectral norm of 
the Puffer Transform (1/C m i n ) is large, for example whenp ~ n, preconditioning can greatly 
amplify the noise, leading to poor estimation performance. As previously mentioned, it is 
an area for future research to investigate if a Tikhonov regularization of F might improve 
the performance in this regime. 

All simulations in this section were deployed in R with the packages lars (for the Lasso) , 
plus (for SCAD and MC+), and glmnet (for the elastic net). All packages were run with 
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their default settings. 

5.3. Satisfying the irrepresentable condition on V(n,p). Theorem 2 shows that, under 
certain conditions, if V ~ uniform(V(n,p)) then V satisfies the irrepresentable condition 
with high probability. One of the more undesirable conditions is that it requires n > cs 4 , 
where c is a constant and s is the number of relevant predictors. As is discussed after 
Theorem 2, if the design matrix contains iid N(Q, 1) entries, then it is only required that 
n > cs 2 . The simulation displayed in Figure 6 compares the irrepresentable condition 
scores: 

(11) \\IC(X, S) = X(S C )'X(S) (X(S)'X(S)) -1 l s ||oo, 

where l s 6 M s is a vector of ones, for X from the iid -/V(0, 1) distribution and for X ~ 
uniform(V(n,p)). In these simulations, n = 200 and p = 10, 000. 



Irrepresentable condition 
similar when n = 200, p = 10k 
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Fig 6. The left panel displays a Monte Carlo estimate of E(IC(X., S)) where IC is defined in Equation 11. 
The solid line corresponds to random matrices X with iid N(0, 1) entries. The dashed line corresponds to 
uniform(V(n,p)). The lines are nearly identical, suggesting that uni for m(V(n,p)) design matrices behave 
similarly to iid Gaussian designs with respect to the irrepresentable condition. The right panel displays 
the quantities 7C(X (l) ,S) — IC(V^\S), where V w is the projection of X w onto V(n,p). The variability 
displayed in the right panel implies that preconditioning only helps for a certain type of design matrix; it does 
not always help. In this example, the columns o/X' ! ' are not highly correlated and sometimes preconditioning 
decreases IC(- , S) and sometimes it increases IC(-, S). 



This simulation sampled 100 matrices £ M 200xl0 ' 000 with iid iV(0, 1) entries, with 
i = 1, . . . , 100. Each of these matrices XM were then projected onto V(n,p) with the 
Puffer Transformation; = F^X", where is the Puffer Transformation for X®. 
By Proposition 1, ~ uniform(V(n,p)). Before doing any of the remaining calculations, 
each matrix was centered to have mean zero and scaled to have standard deviation one. 
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For each of these (centered and scaled) matrices and the values IC(-,S) are 

computed for S = {l,...,q} for q = {1,...,30}. (So, the matrices X® and V^' are 
recycled 30 times, once for each value of s.) The left panel of Figure 6 plots the average 
IC(-, S) for each distribution, for each value of s. The solid line corresponds to the X^'s, 
and the dashed line corresponds to the V^'s. These lines are nearly identical. This suggests 
that uniform(V(n,p)) and the iid Gaussian design perform similarly with respect to the 
irrepresentable condition. 

The right plot in Figure 6 displays the quantities IC(XM, S) — IC(V^ 1 ', S) as a function 
of s. The boxplots illustrate that, even between the paired samples X^ and V^ % \ there 
is significant variation in IC(-,S). Further, there are several cases where it is negative, 
implying 

IC(X®,S) < IC(V (i) ,S). 

In these cases, preconditioning makes IC(-, S) larger (in addition to make the noise terms 
dependent). This suggests that only a certain type of data can benefit from preconditioning. 

6. Relationship to other data analysis techniques. This final section identifies 
four statistical techniques that incidentally precondition the design matrix. Just as a good 
preconditioning matrix can improve the conditioning of X, a bad preconditioning matrix 
can worsen the conditioning of X. Any processing or preprocessing step which is equivalent 
to multiplying X by another matrix could potentially lead to an ill conditioned design ma- 
trix. It is an area for future research to assess if/how these methods affect the conditioning 
and if/how these issues cascade into estimation performance. 

Bootstrapping. In the most common form of the bootstrap, the "non-parametric" boot- 
strap, one samples pairs of observations (Yi,Xi) with replacement from all n observations 
[Efron and Tibshirani, 1993]. Each bootstrap sampled data set is equivalent to left mul- 
tiplying the regression equation by a diagonal matrix with a random multinomial vector 
down the diagonal. This notion is generalized in the weighted bootstrap [Mason and New- 
ton, 1992], which has been suggested for high dimensional regression [Arlot et al., 2010]. 
The weighted bootstrap replaces the multinomial vector of random variables with any se- 
quence of exchangeable random variables. In either case, such a diagonal matrix will make 
the singular values of the design matrix more dispersed, which could lead to potential prob- 
lems in satisfying the irrepresentable condition. If X satisfies the irrepresentable condition 
with a constant rj, then a large proportion of the the bootstrap samples of X might not 
satisfy the irrepresentable condition. Or, if they satisfy the irrepresentable condition, it 
might be with a reduced value of rj, severely weakening their sign estimation performance. 
El Karoui [2010] encountered a similar problem when using the bootstrap to study the 
smallest eigenvalue of the sample covariance matrix. 

Generalized least squares. In classical linear regression, 



Y = X/3* + e 
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with X E M nxp and p « n, if the error terms have expectation zero and covariance S e = 
cr 2 I, then the ordinary least squares (OLS) estimator is the best linear unbiased estimator. 
If the covariance of the noise vector is a known matrix S e which is not proportional to the 
identity matrix, Aitken [1935] proposed what is commonly called generalized least squares 
(GLS). In GLS, the regression equation is preconditioned: 

S7 1/2 y = £,r 1/2 x/r + S7 1/2 e. 

This ensures that the covariance of the noise term is proportional to the identity matrix, 

— 1/2 

cov(S e e) = I. Applying OLS to the transformed equation gives a best linear unbiased 
estimator. 

Huang et al. [2010] proposed i\ penalized GLS for spatial regression. In effect, they 

— 1/2 

preconditioned the regression equation with S e . This is potentially inadvisable. The 
harm from an ill conditioned design matrix potentially overwhelms the gains provided by 
uncorrelated error term. Similarly, if a data set has heteroskedastic noise, it is inadvis- 
able to reweight the matrix to make the noise terms homoscedastic. This was observed in 
Simulation 3 of Jia et al. 2 

Generalized linear models. In a generalized linear model, the outcome Yi is generated 
from a distribution in the exponential family and E(Yi) = g^ 1 {xij3*) for some link function 
g [Nelder and Wedderburn, 1972]. In the original Lasso paper, Tibshirani [1996] proposed 
applying a Lasso penalty to generalized linear models. To estimate these models, Park 
and Hastie [2007] and Friedman et al. [2010] both proposed iterative algorithms with fun- 
damental similarities to Iteratively Reweighted Least Squares. At every iteration of these 
algorithms, X and Y are left multiplied by a different re- weighting matrix. It is an area for 
future research to examine if the sequence of re- weighting matrices might lead to an ill con- 
ditioned design and whether there are additional algorithmic steps that might ameliorate 
this issue. 

Generalized Lasso. Where the standard Lasso penalizes with [|6[|i, the generalized Lasso 
penalizes for some predefined matrix D Tibshirani [2011]. For example, D may 

exploit some underlying geometry in the data generating mechanism. If D is invertible, 
then the generalized Lasso problem 

min||F-X6||| + A|| J D6||i 

b 

can be transformed into a standard Lasso problem for = Db, 

min||Y - XLT 1 ^ 2 . + A[|0[|i. 



2 It was the research in Jia et al. that prompted the current paper. 
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In this problem, the design matrix is right multiplied by D . In this paper, we only con- 
sider left preconditioning, where the matrix X is left multiplied by the preconditioner. 
However, in numerical linear algebra, there is also right preconditioning. Just as a bad left 
preconditioner can make the design matrix ill conditioned, so can a bad right precondi- 
tioner. In practice, the statistician should understand if their matrix D makes their design 
ill conditioned. 

7. Discussion. The information technology revolution has brought on a new type of 
data, where the number of predictors, or covariates, is far greater than the number of obser- 
vations. To make statistical inferences in this regime, one needs to assume that the "truth" 
lies in some low dimensional space. This paper addresses the Lasso for sparse regression. 
To achieve sign consistency, the Lasso requires a stringent irrepresentable condition. To 
avoid the irrepresentable condition, Fan and Li [2001], Zou [2006], Zhang [2010] suggest 
alternative penalty functions, Xiong et al. [2011] proposed an EM algorithm, and others, 
including Greenshtein and Ritov [2004] and Shao and Deng [2012], have described that 
alternative forms of consistency that do not require stringent assumptions on the design 
matrix. In this paper, we show that it is not necessary to rely on nonconvex penalties, nor 
is it necessary to move to alternative forms of consistency. We show that precondition- 
ing has the potential to circumvent the irrepresentable condition in several settings. This 
means that a preprocessing step can make the Lasso sign consistent. Furthermore, this 
preprocessing step is easy to implement and it is motivated by a wide body of research in 
numerical linear algebra. 

The preconditioning described in this paper left multiplies the design matrix X and the 
response Y by a matrix F = UD^U 1 , where U and D come from the SVD of X = UDV' . 
This preprocessing step makes the columns of the design matrix less correlated; while 
the original design matrix X might fail the irrepresentable condition, the new design ma- 
trix FX can satisfy it. In low dimensions, our preconditioner, the Puffer Transformation, 
ensures that the design matrix always satisfies the irrepresentable condition. In high di- 
mensions, the Puffer Transformation projects the design matrix onto the Stiefel manifold, 
and Theorem 2 shows that most matrices on the Stiefel manifold satisfy the irrepresentable 
condition. In our simulation settings, the Puffer Transformation drastically improves the 
Lasso's estimation performance, particularly in high dimensions. We believe that this opens 
the door to several other important questions (theoretical, methodological, and applied) 
on how preconditioning can aid sparse high dimensional inference. We have focused on pre- 
conditioning and the sign consistency of the Lasso. However, preconditioning also has the 
potential to benefit several other clXCctS MS well, including Basis Pursuit and the Restricted 
Isometry Principle [Chen and Donoho, 1994, Candes, 2008] and forms of £2 consistency for 
the Lasso and the restricted eigenvalue condition [Bickel et al., 2009]. 

This is the first paper to demonstrate how preconditioning the standard linear regres- 
sion equation can circumvent the stringent irrepresentable condition. This represents a 
computationally straightforward fix for the Lasso inspired by an extensive numerical linear 
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algebra literature. The algorithm easily extends to high dimensions and, in our simulations, 
demonstrates a selection advantage and improved I2 performance over previous techniques 
in very high dimensions. 

Appendix 
We prove our theorems in the appendix. 

APPENDIX A: LOW DIMENSIONAL SETTINGS 

We first give a well-known result that makes sure the Lasso exactly recovers the sparse 
pattern of /3*, that is /3(A) = s /3*. The following Lemma gives necessary and sufficient 
conditions for sign(/3(A)) = sign(/3*). Wainwright [2009] gives these conditions that follow 
from KKT conditions. 

Lemma 1. For linear model Y = X/3* + e, assume that the matrix X(S) T X(S) is 
invertible. Then for any given A > and any noise term e E R™, there exists a Lasso 
estimate /3(A) described in Equation (4) which satisfies /3(A) = s (3*, if and only if the 
following two conditions hold 



(12) X(5 C ) T X(5)(X(5) T X(5))- 1 \x(S) T e - \sign(f3* (S))] - X(S C ) 



r 



< A, 



(13) sign(p*(S) + {X(S) T X(S))- 1 X(S) T e - \sign(P* (S)) ) = sign{p*(S)), 

where the vector inequality and equality are taken elementwise. Moreover, if the inequality 
(12) holds strictly, then 

/3 = (/3 {1) ,0) 

is the unique optimal solution to the Lasso problem in Equation (4), where 



(14) 



pV) = p*(S) + (XiSfxiS))- 1 \x(S) T e - \sign(P*) 



Remarks. As in Wainwright [2009], we state sufficient conditions for (12) and (13). 
Define 

t = sign(/?*(S)), 

and denote by the vector with 1 in the ith position and zeroes elsewhere. Define 

Ui = efiXiSfXiS))- 1 \x(S) T e - \t~\ , 

Vj = Xj {x(S)(X(S) T X(S))- l \~i - X(S)(X(S) T X(S))- 1 X(S) T -/) e}. 
By rearranging terms, it is easy to see that (12) holds strictly if and only if 



(15) 



M(V) 



max I Vj I < A 
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holds. If we define M(/3*) = minxes |/3*| (recall that S = {j : (3* / 0} is the sparsity index), 
then the event 

(16) M(U) = \max\Ui\ < M(0*) 

[ i&S 

is sufficient to guarantee that condition (13) holds. 

With the previous lemma, we now have our main result for the Lasso estimator on data 
from the linear model with correlated noise terms. It requires some regularity conditions. 
Suppose the ir represent able condition holds. That is, for some constant ij £ (0, 1], 



(17) 



X(S C ) T X(S) (X{S) T X(S) 



-1 



< 1 -rj. 



In addition, assume 

(18) C min = A min (X(S) T X(S) ) > 0. 

where A m i n denotes the minimal eigenvalue, and 



(19) 



max H-Xj'Ha < 1 



We also need the following Gaussian Comparison result for any mean zero Gaussian 
random vector. 

Lemma 2. For any mean zero Gaussian random vector (X\, . . . ,X n ), and t > 0, we 
have 



(20) 
Define 



P( max \Xi\ > t) < 2nexp 

Ki<n 



*(X,/3*,A) = A 



\fC~n 



+ 



2maxiE(X?) 



X{S) T X(S) 













OO- 



Lemma 3. Suppose that data (X, Y) follow the linear model Y = X/3*+e, with Gaussian 
noise e ~ iV(0, S e ). Assume that regularity conditions (17), (18), and (19) hold. If X satisfies 

M(/T) > *(X,/3*,A), 

then with probability greater than 

1 — 2p exp 



2^,2 



AT] 



2A max (S e ) J 

the Lasso has a unique solution /3(A) with /3(A) = s /3*. 
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Proof. This proof is divided into two parts. First we analyze the probability of event 
M(V), and then we analyze the event of Ai(U). 

Analysis of M(V) : Note from (15) that M(V) holds if and only if maXje / c < 1. 



Each random variable Vj is Gaussian with mean 



Hj = XXjX(S)(X(S) T X{S))- 1 b. 



Define Vj = Xj 



I - X(S)(X(S) T X(S))- 1 X(S) T e, then Vj = fij + Vj. Using the 
irrepresentable condition (17), we have \fij\ < (1 — 77) A for all j S S c , from which we obtain 



that 



- max \Vj\ < 7/ => Jfe f ' 11 < 1. 

A je5 c J A 



Recall s = \S\. By the Gaussian comparison result (20) stated in Lemma 2, we have 



1 



max | Vj | > rj 



A jes 



2^2 



< 2{p — s) exp{- 



A^77 



2maxj 6 s'c E(V 2 



■}• 



Since 



E(V?) = XjHZ e HXj, 

where H = I - X(S)(X(S) T ' X (S))' 1 X (S) T which has maximum eigenvalue equal to 1. 
An operator bound yields 



E(Vf) < A max(^e) ||A^j H2 ^ ■A max (S e 



Therefore, 



P 



— max I I > 77 
A j 



< 2{p — s) exp 



2^2 



2A max (S e 



So, 



P 



— max I Vj I < 1 
A j 



— max I Vj- 1 > r\ 
A j 



2^2 



X z r] 



2A max (£ e 



> l-P 

> 1 — 2(p — s) exp 

Analysis of M{U) : 

max\Ui\ < ||(X(5) T X(5))- 1 X(5) T e|| 00 + A||(A:(5) T X( < S))- 1 ^ 
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Define Z { := ej (X(S) T X(S)y 1 X(S) T e. Each Z { is a normal Gaussian with mean and 
variance 

var{Z t ) = eJ(X(S) T X(S))- 1 X(S) T Z e X(S)(X(S) T X(S))- 1 e t 

< A mauX (^ e )eJ(X(S) T X(S))- l X(Sfx(S)(X(S) T X(S)r l e i 

^ ^max(^e) 
Cmin 

So for any t > 0, by (20) 

P(max|Zj| > t) < 2sexp 



ies 1 " " ' ~ *\ 2A max (£ 



by taking t = , we have 



P(max > -£f=) < 2s exp / A " ' r 



*S5 V C mm 2A max (X 

Recall the definition of <3>(X, /3*, A) = A 



-3- + 



P(max|C/i| > tf(X,/3*,A)) < 2s exp 

i 

By condition M(/3*) > f(X,/3*,A), we have 

P(max|C/j| < M(f3*)) > 1 - 2s exp 

i 

At last, we have 

P [M(V)k M(U)} > 1 - 2pexp 



X(S) T X(S) 













OO- 



. We now have 



2^2 



X 2 r] 



2A max (S e ) 

A 2 ry 2 
2A max (S e ) 



aV 

2A max (S e ) 



□ 



We are now ready to prove Theorem 1 in Section 3. For convenience, the theorem is 
repeated here. 

Theorem 4. Suppose that data (X, Y) follows the linear model described in Equa- 
tion (1) with iid normal noises e ~ N(0,a 2 I n ). Define the singular value decomposition 
of X as X = UDV T . Suppose that n > p and X has rank p. We further assume that 
A m m(^X T X) > C min > 0. Define the Puffer Transformation, F = UD- l U T . Let 
X = FX, Y = FY and I = Fe. Define 
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/3(A) = argmin-||y-X6||2 + A| 

b Z 

J/minj g 5 |/3* | > 2A, then with probability greater than 



1 — 2p exp 



n\ 2 C m 
^2a* 



we have (3(\) = s ft*- 

Proof. Data after transformation (X, Y) follows the following linear model: 

Y = X/T + e, 

with e having co-variance matrix Sg = a 2 F T F = a 2 UD~ 2 U T . 
Since 

X'X = X'F'FX = [VDU T ][UD- 2 U T ][UDV T ] = I pxp . 

So the ir represent able condition (17) holds with rj = 1. To apply Lemma 3, we first calculate 
*(X,/3*,A). 



*(X,/3*,A) = A 
Notice that 



V Cmin maXjg^c \\Xj\\2 



+ 



X(S) X(S) 













00 



So, C min = 1, \\XjW2 = 1 and 

*(x,/r,A) 



XX — Ipxp- 

x(sfx(s)) 



1 and, consequently, 



A 

2A. 



+ 



(x(5*) T X(5) 













00- 



Now we calculate the lower bound probability: 

AV 



1 — 2p exp 



2A max (Sg ) 



Notice that Eg = a 2 UD- 2 U'. So A max (£ e ~) = From A min (iX T X) > C min , we 

see that C min < A min (iX T X) = \Knn{VD 2 V T ) = ± min^ 2 .). This is to say mim, B\ > 



28 



JIA, ROHE 



2J2 



1 — 2pexp 
1 — 2p exp 

> 1 — 2p exp 



X Z 7] 



2[A max (S £ -)] 
A 2 min 4 (^) 



2a 2 

n\ 2 C n 
^a 2 



□ 



Next, we prove Theorem 3 in Section 4. The theorem is repeated here for convenience. 

Theorem 5. Suppose that data (X, Y) follows the linear model described at (1) with 
iid normal noises e ~ N(0,o~ 2 I n ). Define the singular value decomposition of X as X = 
UDV T . Suppose that p > n. We further assume that A m - m (^X(S) T X(S)) > C m - m and 
mmi(D 2 i ) > pd mm with constants C m i n > and d m i n > 0. For F = UD~ 1t I t , define 
Y = FY and X = FX. Define 

/3(A) = argminJ||y-X6|| 2 + A||6|| 1 . 

b Z 



Under the following three conditions, 
I. X(.s"-)'X(5) (X{S) J X(,S)' 

3. min ie5 > 2\ v / scp/n 
where c is some constant: we have 



< l-?7, 



2. A min (X(S) J X(5)) > § 



(21) 



P(/3(A) =. /3*)>l-2pexp^-^g^ 



Proof. After transformation, (X, Y) follows the following linear model: 

Y = X/3* + e, 

ith e having co-variance matrix = a 2 F T F = a 2 U 
To apply Lemma 3, first calculate ^(X, (3* , A). 



*(X,/T,A) = A 



+ 



r . 



X(S) X(S) 



2 U T . 














OO- 
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where C mm = A m i n (x(S) X(S)j. By condition (3), we have C m i n > 



cp 



X{S) X(S)) b 



-l 



< 



and, consequently, 

*(x,/r,A) 



V 



< A 



\/Cmin 

Vx/cp 



+ 



V ^min 



x(sfx(s)) 



+ y/scp/ 



in. 



Now we calculate the lower bound probability: 

AV 



1 — 2p exp 



2A max (Xg 



Notice that Eg = a 2 UD~ 2 U'. So A max (£ e ~) 



(mini Da) 2 ' 



1 — 2p exp 
1 — 2p exp 
> 1 — 2p exp 



aV 



2A max ($]g) 
A 2 r/ 2 minj 

p\ 2 r] 2 d m i n 
2a 2 



□ 

APPENDIX B: UNIFORM DISTRIBUTION ON THE STIEFEL MANIFOLD 
To prove Theorem 2, we first present some results related to Beta distributions. 

B.l. Beta distribution. The density function for the Beta distribution with shape 
parameters a > and j3 > is 

f(x) = cx a - 1 (l-xf-\ 

for x E (0, 1). c is a normalization constant. If X follows a Beta distribution with parameters 
(a, (3), it is denoted X ~ Beta(a, 0). 
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Proposition 3. // X ~ Beta(a,/3), then 

= and t>ar(X) 



a + /3 v 7 (a + /3) 2 (a + /3 + l)' 

The next two inequalities for the x 2 distributions can be found from Laurent and Massart 
[2000] (pp. 1352). 

Lemma 4. Let X be a x 2 distribution with n degrees of freedom. Then, for any positive 
x, we have 

P(X -n> 2y/nx + 2x) < exp(-x), 
P(X — n < —2\pnx) < exp(— x). 

By taking x = ^Jn, from the above inequality we immediately have 

Corollary 2. Let X be a \ 2 distribution with n degrees of freedom. There exists a 
constant c such that for a large enough n, 

P(\X -n\> cra 3/4 ) < exp (-ra 1/2 

Corollary 2 says that x 2 ( n ) = n + o{n) with high probability. These large deviation 
results can give concentration inequalities for Beta distributions. This is because a Beta 
distribution can be expressed via \ 2 distributions. 

Lemma 5. Suppose that X ~ X 2 { a )> Y ~ X 2 (fi) and X is independent of Y , then 



Beta{i 



X+Y ^""-V 2 ' 2 >■ 

With the relationship constructed between a Beta distribution and x 2 distributions, we 
can have the following inequalities. 

Theorem 6. Suppose Z ~ Beta^,^)- When both m and n are big enough, there 
exists some constant c, 

P(Z> - n + C f ! ' ) < exp{ _ n l/2 }+exp{ _ (m + n) l/2 } 

\ m + n — c(m + nj J ' 4 / 



(3/4 \ 
Z< ~ I W4 ^ expl-nn + expl-Cm + n) 1 ^} 

m + n + cym + nj J ' 4 I 
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Proof. Let X ~ X 2 ( n ) an d Y ~ X 2 ( m ) are independent. Then X ^_ Y nas the same 
distribution as Z. 

By Corollary 2, we have 

P(|X - n| > cra 3/4 ) < exp{-ra 1/2 }, 
P ( \X + Y - (m + n)| > c(m + n) 3/4 ) < exp{-(m + n) 1/2 }. 



If X, > _ L w + cn ^ 4 > 3/4 , then X > n + cn 3 / 4 or I + F < ra + n- (m + n) 3 / 4 . So 

P ^T^7> —, ^77 < P(X>n + cn 3 / 4 

\X + Y m + n-c(m + n) 3 / 4 I V 

+ PUf + Y<m + n- c(m + n) 3/4 

< exp{-n 1/2 } + exp{-(m + n) 1/2 } 

The same way, we have 



P \ I ri W? ) - expl-n^j + exp^m + n) 1 / 2 } 



□ 



Corollary 3. Suppose Z ~ Peta(|, W) with m > n and n/m — > C3 (0 < C3 < 1). 
There exists a constant c\ such that the following inequality holds when both m and n are 
big enough, 



P 



z n 



m + n 



< 2ex P {-nV 2 } 

m + n 1 



Corollary 3 states that if Z ~ Beta(^, f )(with m > n), then Z = ^ + O(^) with 
high probability. 

Proof. From Theorem 6, we only have to prove that, for any constant cq 



n — con 3//4 n \ / ( n 3//4 



I to + n + Co(m + n) 3 / 4 m + nj / \m + n 
where C2 € M is a constant with the same sign as cq. 



C2, 



32 



JIA, ROHE 



n — cqu 3 ^ n 




m + n + co(m + n) 3 / 4 m + n 



-cnn 3 / 4 (m + n) — c§n{m + n) 3 / 4 m + n 

[to + n + Co(m + n) 3 / 4 ](m. + n) n 3 / 4 
-cnra 3 / 4 (m + n) — CQn(m + n) 3 / 4 1 

[to + n + co(m + n) 3 / 4 ] n 3 / 4 
-co — con 1//4 (m + n) _1//4 



[^^(m + n)- 1 ^] 
C0C3 

-cn as m, n — > 00. 

1 + c 3 



□ 



With the previous results, we can prove results for a random vector uniformly distributed 
on the Stiefel manifold. 

B.2. Stiefel manifold. Suppose that V 6 W ixp withp > n, which satisfies VV' = I n 
- the rows are orthogonal. All of these matrices V form V(n,p), called the Stiefel man- 
ifold [Downs, 1972]. We seek to examine the properties of a matrix V that is uniformly 
distributed on V(n,p). Specifically, we show that any two columns of V are nearly orthog- 
onal. 

We suppose that V comes uniformly from Stiefel manifold. Let X = [Vj,Vk] £ M nx2 be 
two columns of V. Then, from Khatri [1970], if p > n + 1, the marginal density of X is 

c\I n - XX'\ p - 2 - n -\ 

where c is a normalization constant. The density of X'X = A £ M 2x2 is given by 

c \A\( n ~ 3 y 2 \I 2 - A\( p - n ~ 3 V 2 ,0 < A<I, 

where c is a normalization constant. A is distributed as multivariate Beta of type I. 
From Khatri and Pillai [1965], 



E(A) 



n/p 
n/p 



This result shows that any two columns of V are orthogonal in expectation. 

In Mitra [1970], a symmetric matrix U ~ with min(ni,n2) > k has the 

density 

/([/) = c \U\ {ni - k - 1)/2 \I - £/|("2-fc-l)/2_ 
So, A defined above follows ^2(^1 ^y^)- From Mitra (1970), we have the following results. 
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Lemma 6 (Mitra (1970)). // U ~ B k (^-, ^f), then for each fixed non-null vector a, 

a U a/a a ~ Peta{ — , —). 

By taking a = (1,0)', (0, 1)' and ^)' respectively, we have the following results: 



2 ' 2 

A21 A22 



el^-^d,^), we have 



Corollary 4. If A = 
1. A n ~ £eia(f,2=^) 
3. \A n + lA 2 2 + A 1 2~Beta(%,2=?) 

Now, concentration results in Corollary 3 can bound An,A\2, and ^22- This yields an 
inequality to bound H2II Vj\\ 2 ) which describes the linear relationship between two 

columns Vi and Vj. 

Theorem 7. Suppose that V E M riXp with p 3> n uniformly from the Stiefel manifold. 
For a large enough n and p, there exist some constants c\ and C2, such that for any two 
different columns ofV — Vj and V k , the following results hold: 

(1) P \\V'jV k \ > 2Cin3 4 ) < 6exp{-n 1 / 2 }, and 



(2)P 



(p» £ - rl/4 ) £8exp{ -' !l/2 ' 



l^-lbll^lb 

Proof. We first prove (1). From Corollary 3, for large enough n and p, 

,3/4 ^ 



A 



11 



11 



P 



> 



c\n 



P 



< 2exp{-n 1/2 }, 



A 



22 



P 



> 



c\n 



3/4 



< 2exp{-n 1/2 }, and 



P 



(An + A 22 )/2 + A X2 



P 



> 



3/4 



P 



< 2exp{-n 1/2 } 
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Since 



12| < 



< 



{An + A 22 )/2 + A l2 



P 



+ 



{A ll +A 22 )/2 



n 



P 



(A u +A 22 )/2 + A 12 
it implies that, if \A 12 \ > 2ci f /4 , then 

An 
A 22 

(A u + A 22 )/2 + A 12 



n 




An 


n 




-422 


n 




+ 


2p 


+ 


2p 


P 


2 


2 



n 


> 


cm 3 / 4 




P 




p 


n 


> 


cm 3 / 4 




P 




p 


n 


> 


cin 3 / 4 




P 




P 



or 



or 



So, 



P \A 12 \ > 



2 Cl n 3 / 4 ^ 



p 



< P 



+ P 



(An + A 22 )/2 + A 12 



n 



P 



> 



c\n' 
P 



3/4 ^ 



111 



n 



P 



> 



3/4 



V 



+ p 



n 



A 22 - - 
P 



> 



3/4 



P 



< eexp-f-n 1 / 2 }. 



Now we prove (2). If 



\V'V k \ 



> 



2cm 3 / 4 



n cin 3 / 4 12 



we must have \V>V k \ > or < r j - ^ or ||F fe ||| < f - 



3/4 



So, 



2cm 3 / 4 



> 



3/4 



+ p 



j 11 2 



2 Cl n 3 / 4 


) 


P 




2 ^ n 
2 — 


cm 3 / 4 




p 


2 ^ n 
2 — 


cin 3 / 4 


P 


P 



< Sexp-f-ra 1 / 2 }. 
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Note that 

2cm 3 / 4 



- CKn- 1 / 4 ). 

p p~ 



n _ gin 



3/4 



So, for large enough n, there exists a constant C2 such that 

K«b £C2 "" I/4 ) s8exp{ -" 1/2} ' 



□ 



Corollary 5. Suppose that V 6 M nxp zs uniformly distributed on the Stiefel manifold. 
After normalizing the columns ofV to have equal length in £2, the irrepresentable condition 
holds for s relevant variables with probability no less than 

1 - 4p(p - l)e~ nl/2 , 
as long as n is large enough and n > (2c2) 4 (2s — l) 4 . 

Proof. Let C be the normalized Gram matrix of V defined as 

Cjk 

From Theorem 7, 



\VjVk\ 



P (\Cj k \ > c 2 n~ 1 /^ < 8exp{-n^ 2 }. 
Using the union bound, 

P (max\C jk \ > c 2 n~ 1 / 4 ^ < 4p(p - 1) expl-n 1 / 2 }. 

By Corollary 2 of Zhao and Yu (2006), when max|Cjj| < jthi 1?or some < c < 1, the 
irrepresentable condition holds. So, if C2n _1 / 4 < 2 (2s-i) > ^ 18 ^ is n > (2c2) 4 (2s — l) 4 , we 
have 



P (max \C jk \ > 2(2s 1 _ 1) ) > 1 " MP ~ 1) expi-n 1 / 2 }. 



□ 



Theorem 8. Suppose that X € W ixp is a random matrix with each element Xij drawn 
iid from N(0, 1). Then, after normalizing the columns of X to have equal length in £2, the 
irrepresentable condition holds for s relevant variables with probability no less than 



1 - ^(P - !)e le^IF - 3p(p - l)e 



for any < c < 1. 
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This implies that n must grow faster than s 2 log(p) for X to satisfy the irrepresentable 
condition. 

Proof. Let C be the empirical correlation matrix of X 



Cjk — 



/l/nS=i4xl/nS =1 V 



By Corollary 2. of Zhao and Yu (2006), when max|Cjj| < 5i~p holds. Now we try to 
bound P (max^j \C\j\ < ^zi) ■ 
Note that 

1/n X je X kt \X k ~N (o,l/n 2 ^ X 2 k A . 
e=i \ e / 

By a concentration inequality on % 2 distribution, 

P(l/2< <2) >l-2e"n. 

n 

For Z ~ 7V(0,2/n), 

P(l/n^X^X M >i| -^X| £ <2] <P(Z>t)<e-^. 
V *=i n < / 

This holds because the variance increases. So, 

P (l/n X ^ X ki > ^ = P (l/n E X ^ > * I I E X ^ < 2 ) P ^ E X H < 2j 

+P (l/npXjiXu > t | 1 E X ^ > 2 ) P E X ^ > 2 ) 

< P |l/n^ 3< I« > * | 1 E X H < 2^ + P ^ E X « > 2 ) 

< e -^+2e-"/ 16 . 

Finally, 

(n n n \ 

1 1/n E < a/2, 1/n E X % > V 2 , V" E X ^ > */2 

£=1 £=1 £=1 / 

n n 

> P(l/n E X jt X M < a/2) + P(l/n E > 1/2) 
£=i £=i 
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+ P(l/nY J X 2 M >l/2)-2 



-ra/16 



> 1 - e~"RT - 2e^ n/lb + [1 - 2e~i6] x 2 - 2 
= 1 — e i6 — 6e is . 

Taking a = 

C nc ' 2 n 

P{\C jk \ < > 1 - e - 3e~T6 

and 

P(max\C jk \ < -) > 1 - t,p(p- l)e ">0— i) a - 3p{p - l)e~w. 

j^k 2s — 1 2 

□ 
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